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ABSTRACT 

Human-robot communication is often faced with the diffi¬ 
cult problem of interpreting ambiguous auditory data. For 
example, the acoustic signals perceived by a humanoid with 
its on-board microphones contain a mix of sounds such as 
speech, music, electronic devices, all in the presence of atten¬ 
uation and reverberations. In this paper we propose a novel 
method, based on a generative probabilistic model and on 
active binaural hearing, allowing a robot to robustly perform 
sound-source separation and localization. We show how in- 
teraural spectral cues can be used within a constrained mix¬ 
ture model specifically designed to capture the richness of 
the data gathered with two microphones mounted onto a 
human-like artificial head. We describe in detail a novel 
EM algorithm, we analyse its initialization, speed of conver¬ 
gence and complexity, and we assess its performance with 
both simulated and real data. 

Categories and Subject Descriptors 

G.3 [Mathematics of Computing]: Probability and Statis¬ 
tics— Multivariate statistics, Probabilistic algorithms ; 1.2.7 
[Natural Language Processing]: Speech Recognition and 
Synthesis 

General Terms 

Algorithms, Theory, Experimentation 

Keywords 

Blind source separation, computational auditory scene anal¬ 
ysis, EM algorithm, learning 

1. INTRODUCTION 

There is an increasing interest in robots able to commu¬ 
nicate with people in the most natural way, e.g., Fig. 1. 
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Figure 1: A cocktail party robot should be able to communicate 
with people in the most natural way. One fundamental task that 
such a robot should be able to accomplish is to localize the speak¬ 
ers in a room and to separate their emitted speech signals, all in 
the presence of music, background noise, and reverberations. 


Such robots must be endowed with the ability to reliably 
process and understand sensory inputs, e.g, visual or au¬ 
ditory data, gathered in unconstrained physical situations. 
Within the field of computational auditory scene analysis 
(CASA) tremendous progress was made in speech recog¬ 
nition. Nevertheless, current approaches work well with a 
single sound source often recorded with a close-range micro¬ 
phone. A more natural setup, e.g., humanoids with their 
own on-board microphones, implies to deal with much more 
complex and challenging acoustic inputs involving auditory 
data of various kinds (speech, prosody, music, electronic de¬ 
vices, etc.) and originating from sparsely located multiple 
sound sources, all in the presence of noise, attenuation and 
reverberations. 

A classical example illustrating the difficulty of model¬ 
ing such situations is the well known cocktail party problem 
(CPP) [9]. While human listeners solve this problem rou¬ 
tinely and effortlessly, we note that it has not been prop¬ 
erly addressed from the perspective of robot audition. Two 
key aspects of CPP are localization and separation of sev¬ 
eral sound sources. We believe that principled solutions to 
these problems are some of the prerequisites for addressing 
higher-level tasks in HRI such as speech and music recogni¬ 
tion, verbal communication, dialog handling, etc. 

In this paper we propose a new method for solving for 
sound-source separation and localization based on a gen- 







erative probabilistic model and on active binaural hearing. 
More precisely, we promote a novel robot audition paradigm 
based on interaural spectral features, namely the interau- 
ral level difference (ILD) and the interaural phase difference 
(IPD) and on a constrained mixture model specifically de¬ 
signed to capture the richness of the binaural data recorded 
with a robot head endowed with a human-like head related 
transfer function (HRTF). We formally derive an EM pro¬ 
cedure that alternates between separation (E-step) and lo¬ 
calization (M-step). We show how our system can be fully 
automatically and efficiently trained using an audiomotor 
map. We analyse our algorithm in detail and we assess its 
performance with both simulated and real data. 

There is behavioral and physiological evidence that hu¬ 
mans use binaural cues in order to infer the direction of 
a sound. Two such cues seem to play an essential role, 
namely the ILD (already mentioned) and the interaural time 
difference (ITD). A number of computational models have 
been recently developed for robust sound localization and 
sound tracking based on ITD and/or ILD [19, 24]. How¬ 
ever, it is well known that the spatial information provided 
by interaural-difference cues within a restricted band of fre¬ 
quency is spatially ambiguous, particularly along a roughly 
vertical and front/back dimension [15]. To avoid these ambi¬ 
guities, more accurate sound localization models incorporate 
the HRTF, e.g. [12]. These approaches are based on the fact 
that the particular shape of the head, pinna and torso act as 
a filter depending on the emitting 3D location of the sound 
source (distance, azimuth, and elevation). However, HRTF 
databases are subject-specific and room-dependent (noise, 
reverberations, room geometry, etc.), and only a handful of 
HRTF databases are available in practice [1, 21]. This makes 
them hardly applicable to a real robotic application. To 
overcome these issues, the HRTF of a specific robot can be 
automatically learnt using audio-motor maps [8, 10]. Such 
maps are built by recording a static full-spectrum sound 
source from different motor states of the robot. 

The problem of sound source separation has been thor¬ 
oughly studied in the last decades and several interesting 
approaches were proposed. For example, [4, 20] and many 
others achieve separation with a single microphone, based on 
known acoustic properties of speech signals, and are there¬ 
fore limited to a specific type of input. Other techniques 
such as independent component analysis (ICA) [6] or multi¬ 
microphone techniques require as many microphones as the 
number of sources. Several other methods use binaural lo¬ 
calization cues for source separation [11, 14, 23]. In [11] 
acoustic inputs at different frequency channels are clustered 
over time by means of some assumptions on the emitted sig¬ 
nals, and an HRTF data look-up table is used to find their 
corresponding positions in space. Once exact locations are 
known, up to two sources can be separated using the HRTF 
at each frequency channel. 

Our method is based both on clustering (localization) and 
on spectral masking (separation). Spectral masking, also 
called binary masking, allows the separation of an arbitrary 
number of sources from a mixed signal, with the only as¬ 
sumption that a single source is active at every frequency¬ 
time point (/, t). This is referred to as the W-disjoint or¬ 
thogonality assumption [25] and it has been shown to hold, 
in general, for simultaneous speech signals; It is particularly 
well suited for binaural recordings in realistic environments. 
Recently, [14] proposed a probabilistic model for multiple 


sound source separation based on interaural spatial cues. 
For each sound source, a binary mask and a discrete distri¬ 
bution over interaural time delays is provided. This can be 
used to approximate the azimuth angle of the source with 
a front-back ambiguity, if the distance between the micro¬ 
phones is known. 

It appears that there are very few methods that formally 
combine 2D localization and separation. The first original¬ 
ity of our approach is to formally cast the localization and 
separation tasks into a generative probabilistic model that 
is solved very efficiently with a novel EM algorithm. Full 
details of this algorithm and its initialization as well as fa¬ 
vorable comparisons with recent binaural-based separation 
methods are presented in a companion paper [7]. The sec¬ 
ond originality of our approach is to use an active robotic 
head in order to automatically learn audio-motor maps for 
a very large set of sound-source directions located in the far 
field. Such a training phase incorporates an implicit model 
of the HRTF and leads to a high precision both in terms of 
source separation and of localization. A thorough evaluation 
of the method with simulated and real data is provided, and 
puts forwards this approach as a promising future tool for 
auditory human-robot interactions. 

The remainder of this paper is organized as follows: sec¬ 
tion 2 describes a binaural sound representation, section 3 
presents in detail the formal model as well as its associ¬ 
ated EM algorithm, section 4 describes the data acquisition 
and recording technique, section 5 presents our validation 
method, experiments and results. Concluding remarks and 
directions for future work are discussed in section 6. 

2. BINAURAL SOUND REPRESENTATION 

Both sound source localization and separation require a 
proper representation of the perceived data. For localiza¬ 
tion, one needs a content-independent representation that 
contains as much spatial information as possible. For sepa¬ 
ration, one needs a representation preserving all the richness 
of the original signals. In this paper we put forward two bin¬ 
aural representations, namely ILD and IPD spectrograms. 

As already mentioned in section 1, our sound source sep¬ 
aration method is based on binary masking. This tech¬ 
nique consists in “filtering” the original signal by weighting 
all frequency-time points corresponding to the target source 
with 1 and all the other points with 0. Consequently, the 
perceived signal needs to be described within a time-varying 
spectral representation, a spectrogram. Spectrograms associ¬ 
ated with each one of the two microphones are computed us¬ 
ing a short-term fast Fourier transform (FFT) algorithm [2]. 
We use a time window of 64ms with 8ms overlap between two 
consecutive windows, thus yielding T = 126 time-windows 
for a signal lasting Is. Since sounds are recorded at a sample 
rate of 24,000Hz, each time-window contains 1,536 samples, 
multiplied by a Hann window padded with zeros, for a total 
length of 2,048 samples. Each window is then transformed 
via FFT to obtain complex coefficients of F — 1,024 pos¬ 
itive frequency channels between 0 and 12,000Hz. We de¬ 
note with s^l G C the spectrogram value at frequency-time 
point (/, t) of a signal emitted by a sound source k and with 
s^l and s^ the spectrogram values perceived by the left- 
and right-microphone respectively. The spectrogram of the 
emitted acoustic level is defined by: 

a u = 101 °g(l s / fc tl 2 ) 


(1) 


while the spectrogram of the perceived acoustic level is de¬ 
fined by 

°$f = 101og(| S ^ ) | 2 + |4 R t ) | 2 ) (2) 

Fig. 2-(a) and (b) show an example of spectrograms of emit¬ 
ted and perceived acoustic levels. 
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(a) Emitted spectrogram (b) Perceived spectrogram 
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(c) ILD spectrogram (d) IPD spectrogram 


Figure 2: Spectrograms corresponding to a male utterance 

emitted from the left-hand side of the binaural head. 


The W-disjoint orthogonality assumption implies that a 
single sound source emits at a given frequency-time point 
(/, t). Let k be that source, the HRTF model provides a rela¬ 
tionship between emitted and perceived spectrogram points 
at (/,£): 


sfj. = h^\x k: and = h^ K \x k , f)sf} ( 3 ) 

where x k G M 3 denotes the 3D position of sound source k in 
a robot-centered coordinate frame (the origin of this frame 
is the midpoint between the two microphones and the x, y, 
and z axes pointing towards the left microphone, the head- 
top and in front of the head) and and h^ denote the 

left and right HRTF. The latter functions depend on both 
the emitter’s position and the frequency and they act as 
linear filters on the emitted signal. The HRTFs are mainly 
determined by the shape of the head, pinna and torso of the 
listener, e.g, the robot-mounted dummy-head in our case. 
The interaural transfer function (ITF) / is defined by the 
ratio between the left- and right-HRTF: 




h^HxkJ) r 

hM(x k ,f) 


(4) 


We can now define the interaural spectrogram as the ratio 
between the left- and right-microphone spectrograms: 




J R ) 

s f,t 

s (L) 


(5) 


From (3), (4) and (5) one may notice that if the source 
k emitting at time-frequency point (/, t) is located at x k 


then If t t ~ I(x k ,f). The equality is only approximate 
due to the sensor noise and FFT errors. Therefore, at a 
given frequency-time point, the interaural spectrogram value 

(k) 

//,t does not depend on the emitted spectrogram value ' 
but only on the source position x k . We finally define the 
ILD spectrogram a and the IPD spectrogram (f as the log- 
amplitude and phase of the complex interaural spectrogram 
If,u e-g-, Fig. 2-(c) and (d): 

otf, t = 20log \If,t\ € R and = arg€ ]-7T,7r] (6) 

3. SEPARATION AND LOCALIZATION 

This section starts by describing how a training set of 
interaural parameters can be built out of a sound dataset 
annotated with sources positions. After formally stating the 
problem, we then introduce our novel probabilistic model for 
interaural parameters in a mixture of point sound sources. 
We finally detail a new version of the EM algorithm for 
this model, allowing to achieve simultaneous sound sources 
separation and localization based on learned positions. 

3.1 Building a Training Set 

We first explain how a training set of interaural param¬ 
eters can be built out of a sound dataset annotated with 
sources positions. Let X — {x n }n=i be a set of 3D coordi¬ 
nates in the robot centered frame that we already described 
above. If a single sound source n emits white noise from 
Xn G X, let and {(ff t } F f ^ l t= i be the per¬ 

ceived ILD and IPD spectrograms. The mean ILD vector 
pb(x n ) = (/jl i ... /if ... Pf) t G associated with x n is de¬ 
fined by taking the temporal mean of a at each frequency 
channel: 

P/(*n) = (V 

t= 1 

Similarly, the mean IPD vector £(x n ) = (£1 • • • £/ • • • £f) 
is the temporal mean of the IPD spectrogram 0 at each 
frequency channel: 

1 T 

£/(*«) = ar S £ ]—tt,tt] (8) 

t= 1 

The components of the mean IPD vector are estimated in 
the complex domain in order to avoid problems due to phase 
circularity, as suggested in [16]. White noise is used because 
it contains equal power within a fixed bandwidth at any 
center frequency: the source n is therefore the only source 
emitting at each frequency-time point (/, t), and pf and 
£/ are thus approximating the log-amplitude and phase of 
/(.,/). The set A of 3D sound-source positions as well as 
the mappings pi and £ will be referred to as the training data 
to be used in conjunction with the sound-source separation 
and localization algorithm described below. 

3.2 Problem Formulation 

We suppose that there are K sound sources randomly lo¬ 
cated in the robot’s environment and that these K sources 
emit simultaneously sounds with unknown spectrograms from 
the unknown locations {x k }k=i C X. From the acous¬ 
tic inputs perceived by the robot’s microphone pair, one 
can build the ILD and IPD spectrograms t=1 and 

1^/,*}/=i t =i as alrea dy described. The goal of the sound- 
source separation and localization algorithm is to associate 












each perceived frequency-time point (/, t) with one of the 
sound sources and to estimate their 3D locations. 

The observed a/ jt and (j>f it are useful only if there is a 
significant signal at (/, t). To identify such significant ob¬ 
servations we estimate the perceived acoustic level with 
(2) and we retain only those frequency-time points for which 
the acoustic level is above a threshold, > e/. One em¬ 
pirical way to choose the thresholds (one at each frequency) 
is to measure the highest perceived acoustic level at each / 
in the absence of emitting sound sources. These thresholds 
are typically very low compared to perceived acoustic levels 
due to natural sounds, and allow to filter out frequency¬ 
time points corresponding to the “room silence”. We call 
level mask the associated spectral mask. We denote by Mf 
the number of significant observation at / and by <a/, m and 
(j>f,m the ra-th significant ILD and IPD observations at /. 
The observed data will be denoted by A = {<a/ 5rn }j^ =1 ^ n=1 
and $ = 

We also introduce the missing data , Le., a set of unob¬ 
served variables Zf jrn G {0,1} K , such that Zf iT n,k = 1 if ob¬ 
servations Oif^m and 4>f,m were generated by source k, and 
Zf,m,k = 0 otherwise. The W-disjoint orthogonality con¬ 
straint can therefore be written as: 


frequency / and the A function is defined by 

A(*, 2 /) = arg(e j(a:_! ' ) ) € ]—7r,7r] (12) 

The distribution ZAf(£,p 2 ) approximates the normal dis¬ 
tribution on the circle ]— 7r,7r] when p is small relative to 
27r. Preliminary experiments on IPD spectrograms of white 
noise showed that this assumption holds in the general case. 

Note that in our model, the source positions act as latent 
constraints on Gaussian means and thus tight the observa¬ 
tions at different frequency channels. Such an approach can 
also be used to tight auditory and visual observations, as 
recently done in [13]. 

As emphasized in [14], the well known correlation between 
ILD and IPD does not contradict the assumption that Gaus¬ 
sian noises corrupting the observations are independent. Un¬ 
der this assumption, the conditional likelihood of the ob¬ 
served data given its assignment to source k 

located at x k and with variances a 2 k and p 2 k is 

,m I Z f ,777,5 &f,k) Pf ,fc) 

N(o.f,m\P'f(Xk),0'f,k) * ^J\f ((f) f jr n\ti f k) i Pf,k) (13) 

The prior probability of assigning an observation to a sound 
source writes: 


K 

= i ( 9 ) 

k=1 

Variables z/, m are also called data-to-source assignments 
at (/,m), and M k — {zf i m,k}f , = i f rn =i corresponds to the 
binary spectral mask associated with sound-source k. Fi¬ 
nally, we denote with Z — {z / m=1 the set of all 

missing data. The problem of simultaneous localization and 
separation amounts to estimate the sound-source locations 
{y k } k=1 and the masking variables Z, given the number of 
sound sources K and the observed data A and 4>. 


3.3 A Constrained Mixture Model 

We assume that both the ILD and IPD observed data are 
drawn from normal distributions. The conditional likelihood 
of the observation <a/, m given its assignment to sound source 
/c, i.e., Zf iTn ^ k — 1 and located at x k is therefore drawn from 
a ID Gaussian distribution centered at pf(x k ) and with 
variance cr 2 k : 


P{pLf,m\Zf,m, k i •£k 5 &f ,fc) — A/"((a/,777, \ P J f (®fe)5 &f,k) 
1 ( ( a f,rn ~ M/^fc)) 


(27t) 1 / 2 CT/,/ c 


exp 


2(J lk 


( 10 ) 


For simplicity, the expression Zf >m ,fc = 1 is replaced by 
Zf,m,k in probabilities. Since the IPD data lie on the circle 
]—7r,7r], they should be modeled by a circular normal distri¬ 
bution. As proposed in [14], we approximate the wrapped 
normal distribution with a ID Gaussian. The conditional 
likelihood of the observation 0/, m given its assignment to 
sound source k located at x k is given by: 


P(4>f ,777 \ Z f ,m,k 5 X k ,Pf , k ) = ZA/'(0/,m|O( aJ fc),P/,fc) 

1 I A 


(2ir) ll2 p f ,k 


exp 


2 P 2 


f,k 


(ii) 


where p 2 k is the IPD variance associated with source k at 


P(^Zf^rn,k) — 7T f,k (14) 

Finally, we denote by 0 the set of all the parameters of our 
binaural mixture model: 

f 1 F ' K 

0 = \ { Xk}-, { 7 r/,fc}; {cr/,fe }; {p 2 f,k} \ (15) 

l J f=i,k=i 

3.4 The Separation-Localization Algorithm 

The problem of both source separation and source local¬ 
ization can now be expressed as an optimal parameter esti¬ 
mation problem, namely the maximization of the observed- 
data log-likelihood over the parameters ©: 

© = argmax£(A, ©) (16) 

In order to keep the model as general as possible, there is no 
assumption on the emitted sounds as well as the way their 
spectra are spread across the frequency-time points. There¬ 
fore, we assume that all the observations are independent: 

£(A,$;©) = log P(A, <E>|©) 

F M f 

= E E l °S P ( a f,mAf,m\0) (17) 

/ = 1 777=1 

It is well established that the direct optimization of (17) 
is difficult because of the presence of many local maxima. 
Therefore we recast the problem within the framework of 
maximum likelihood with missing data , i.e., Z in (13), which 
is traditionally solved via expectation-maximization (EM). 
The EM algorithm alternates between estimating the ex¬ 
pected complete-data log-likelihood (E-step) using the cur¬ 
rent model parameters and maximizing this likelihood over 
the model parameters given the current posterior probabil¬ 
ities (M-step). The posterior probability of ^/,m,fc is 

defined by: 

V f,m,k -F > (^'/,777, k | C^/,777 , 0/,777 5 ©) (18) 

As it will be detailed below, the E-step of the proposed algo¬ 
rithm computes the posterior probabilities of assigning each 






spectrogram point to a sound source k (separation-step ), 
while the M-step maximizes the expected complete-data log- 
likelihood with respect to the model parameters ©, namely 
the priors, the variances and, most notably, the source loca¬ 
tions {xk}k=i (localization-step ). 

One of the most interesting properties of EM algorithms 
is that the log-likelihood C in (16) is increased at each EM 
iteration and hence it converges to a local maximum. An 
outline of the proposed EM method is provided in Algo¬ 
rithm 1. The finally estimated posterior probabilities allow 
to compute the binary spectral masks A ik associated with 
each source k while the final parameters Xk provide esti¬ 
mates for the source locations, v ^ denotes the value of 
variable v at iteration p, and denotes its final value. 
The E-step, M-step, initialization and convergence check are 
detailed below. 


Algorithm 1 Separation-Localization EM 
1: Input: A, {fJ,(x n )}n=i, {€(*n)}n=i, K 
2: Output: {®^ P) }^ =1 , {M k }k=i 
3: ©(°) := initialize 
4: p := 0 

5: while Iconverged do 

6: p := p + 1 

7: i r f P) m,k} '■= E-step(0 (p - 1) ) 

8: 0 (p) := M-step({r^ >fc }/, m , fc ) 

9: end while 

10: Mk ■= (k == argmaxr^ fc ,)J=T,m=i 

k' 


Based on Bayes’ formula and on marginalization rules, 
the E-step computes the current posterior probabilities con¬ 
ditioned by the previously estimated parameters, i.e., (13): 


(p) ._ ^f,kP(0Lf,m, ©^ ^) Qg\ 

/,m,fc 7 4>f,m\Zf,m-, 0 (p_1) ) 

The expected complete-data log-likelihood can now be 
written as: 

Q(0|0 (p_1) ) = E (Z]A ^ ;e) [log P( A, «&, Z|0)] 

F M f K 

= EEE r f\rn,k lQ g 7F f,kP(°Lf,my 4>f,m\zf,m] ©) (20) 

fmi m=1 k =1 

The M-step maximizes (20) with respect to ©: 

©O) 0 _ argmaxQ(©|© ( ' p_1 ' ) ) (21) 


By combining (13) with (20) the problem becomes equiv¬ 
alent to minimizing: 


EE 


r (p) 

f,m 


log 


f =1 m =1 

( x f,m ~ Pf( x k)) 

a },k 


( fh 

\*f,k 


+ log 


P},k 


Kf,k ' 
^(0/,m; £f( x k)) 

Plk 


( 22 ) 


which can be easily differentiated with respect to {7T/,fc}/, 
{(7f,k}f and {pf,k}f to obtain closed-form expression, con¬ 
ditioned by Xk, for the optimal values of these parameters. 
These expressions are then subtituted in (22), which is eval¬ 
uated for all Xk 6 A to find the optimal position Xk, and 


deduce all the other optimal parameters. Interested readers 
can find a detailed solution in [7]. 

As already mentioned, EM converges to a local maximum 
of the observed data log-likelihood function C. However, the 
non-injectivity of the interaural functions /if and leads to 
a very large number of these maxima, especially when the 
set of learned positions A, i.e., section 3.1, is large. This 
makes the algorithm to be very sensitive to initialization. A 
common way to avoid being trapped in local maxima may be 
to initialize the parameters at random, but such a strategy 
cannot be directly applied here: First, because the cardinal¬ 
ity of the parameter set © is very large and second, because 
there is no straightforward way to initialize the variances 
cj 2 f k and p 2 f k- Another possibility may be to randomly ini¬ 
tialize the source assignment variables Z and then proceed 
with the M-step, but extensive experiments with simulated 
data revealed that the algorithm very rarely converged to a 
global maximum (in less than 0.1% of the cases). We there¬ 
fore decided to adopt a method that combines these two 
initialization strategies by randomly perturbating both the 
source locations and the source assignments. 

This lead us to develop a stochastic initialization proce¬ 
dure similar to the stochastic EM (SEM) algorithm [5]. The 
idea of exploiting stochasticity to escape from local maxima 
is a commonly used principle in global optimization [26]. 
The SEM algorithm includes a stochastic step (S) in be¬ 
tween the E and the M steps, during which random samples 
Rf,m,k G {0,1} are drawn from the posteriors These 

samples are then used instead of the posterior probabili¬ 
ties during the M-step. To initialize our algorithm, we first 
set all the posterior probabilities to 1/K and then proceed 
through the following step sequence: S M* E S M, where 
the M*-step is a variation of the M-step in which the sources’ 
positions are drawn randomly from A instead of computing 
{ Xk}k■ Experiments with simulated data showed that this 
technique converged to a global optimal solution in over 10% 
of the cases. More advanced techniques may also be used to 
improve the convergence rate, and are detailed in [7]. 

In practice, twenty stochastic initializations are used in 
order to increase the chances of correct convergence, and 
only the one providing the best log-likelihood after two iter¬ 
ations is eventually iterated until the convergence criteria is 
satisfied. The algorithm stops either when the log-likelihood 
gain is less than 1%, or after p ma x = 20 iterations. Indeed, 
from two hundred simulated experiments (see section 5.2), 
we concluded that, on an average, the algorithm converges 
in eleven iterations. 

3.5 Algorithm Complexity 

Both the E- and S-steps are linear in the total number 
of significant observations, Mf and in the number of 
sound sources, K. The M-step is linear in the number 
of frequency channels F and in the number N of source 
locations available in the training set A. In the case of 
simulated data using one-second long signals composed of 
two sources, the values of these parameters are: K — 2, 
F = 1024, = 70,000, and N § 10,800. With a 

Mat lab implementation executed on a 2.53GHz Intel-Xeon 
processor, we obtained the following average running times: 
49ms (E), 1030ms (S), 2520ms (M) and 24ms (M*). The 
most time-consuming part of the algorithm is its initializa¬ 
tion (20 stochastic initializations iterated 2 times) which 
takes 195.9 seconds, while each subsequent EM iteration 








takes 2.569 seconds, amongst which 98% corresponds to se¬ 
lecting all optimal Xk in T. Careful algorithm and software 
optimization will allow, however, to obtain realistic execu¬ 
tion times needed for human-robot interactions scenarios, 
e.g., one to two seconds. 

4. DATA ACQUISITION 

In order to build the training set introduced in section 3.1, 
we developed a technique to learn a large number of sound 
source locations in an entirely unsupervised and automated 
way using the motor system of a binaural robot head. This 
technique is initially inspired from the sensorimotor theories 
of early development in psychology, suggesting that experi¬ 
encing the sensory consequences of voluntary motor actions 
was necessary for an organism to learn the perception of 
space [17]. In particular [3] argued that naive organisms 
such as humans and echo-locating bats could learn sound 
localization based solely on acoustic inputs and their rela¬ 
tion to motor states. This idea was experimentally validated 
using a robot system in [8]. 


Figure 3: A binaural head is placed onto an agile device that 
can perform precise and reproducible pan and tilt motions (left). 
The emitter (a loud-speaker) is placed in front of the robot head 
at approximately 2.7 meters (right). 

Sound acquisition is performed with a Sennheiser MKE 
2002 acoustic dummy-head linked to a computer via a Beh¬ 
ringer ADA8000 Ultragain Pro-8 digital external sound card. 
The head is mounted onto a robotic system with two rota¬ 
tional degrees of freedom: a pan motion and a tilt motion 
(see Fig. 3). This device was specifically designed to achieve 
precise and reproducible movements. The emitter - a loud¬ 
speaker - is placed at approximately 2.7 meters ahead of the 
robot, as showed on Fig. 3. The loud-speaker’s input and 
the microphones’ outputs were handled by two synchronized 
sound cards in order to simultaneously record and play. All 
the experiments were carried out in real-world conditions, 
i.e., a room with natural reverberations and background 
noise due to computer fans. All the recordings are publicly 
available 1 . We believe that such a large audio-motor data 
set has no equivalent today, and is therefore a contribution 
in its own right. 

Rather than placing the emitter at known 3D locations 
around the robot, it was kept in a fixed reference position 
while the robot recorded emitted sounds from different mo¬ 
tor states. Consequently, a sound source direction is di¬ 
rectly associated to a pan-tilt motor state {ip, 6) rather than 
a 3D point in space. A robot trained in such a way would 
therefore be able to perform the head movement pointing to- 

x http://perception.inrialpes.fr/~Deleforge/CAMIL_ 
Dataset/ 


ward an emitting sound source in an entirely unsupervised 
way, without needing the inverse kinematics, the distance 
between microphones, or any other parameters. 

Recordings were made from 10, 800 uniformly spread mo¬ 
tor states: 180 pan rotations ip in the range [—180°, 180°] 
(left-right) and 60 tilt rotations 0 in the range [—60°,60°] 
(top-down). The associated set of 3D source positions T C 
R 3 could be deduced using the direct kinematic model of the 
robot given in [8]. However, the speaker was located in the 
far field of the head during experiments (> 1.8 meters), and 
[18] showed that HRTFs mainly depend on the sound source 
direction while the distance has fewer impact in that case. 
That is why sound source locations will be expressed with 
angles in the rest of the paper. 

At each motor state, five binaural recordings of one second 
each were made while the speaker emitted different sounds. 
Sound 1 corresponds to white noise, and was used to build 
the training set (section 3.1). Sounds 2, 3 and 4 form the 
test set (see section 5.3), and correspond respectively to a 
woman pronouncing “ Bonjour /”, a man pronouncing “Un pe¬ 
tit cafe?”, and a flute melody. Sound 0 corresponds to “room 
silence” and was used to determine the perceived acoustic 
level threshold e/ during tests (see section 3.2). 

5. EXPERIMENTS AND RESULTS 
5.1 Performance Evaluation 

Algorithm 1 was tested and validated using the database 
described in the previous section. In all presented results, 
a source is considered as correctly localized if the algorithm 
could locate it within 2° of absolute angular error in both 
azimuth and elevation. For real mixtures (section 5.2), the 
separation was evaluated using standard metrics, namely 
Source-to-Distortion Ratio (SDR) and Source-to-Interferen- 
ces Ratio (SIR), proposed in [22]. These metrics are based 
on the decomposition of the estimated signal into the tar¬ 
get signal, the error term coming from other interfering sig¬ 
nals, and the error term coming from unexplained artifacts. 
Another term can be added to evaluate background noise 
reduction, but this is outside the scope of this work. Once 
the decomposition is achieved, the SDR is defined by the ra¬ 
tio of energy between the target and the error terms, while 
the SIR is the ratio of energy between the target and the 
interference term only. Both these metrics are expressed in 
decibels. As they are computed from ID signals, the masked 
left and right microphone spectrograms were converted back 
to temporal signals using the inverse FFT, and concatenated 
to evaluate the quality of binaural-based separation. 

SDR and SIR scores obtained with our algorithm are given 
together with upper and lower bounds. The upper bound 
corresponds to the SDR of the ground truth mask or oracle 
mask [25], which is set to 1 at every spectrogram point in 
which the target signal is at least as loud as the combined 
other signals and 0 everywhere else. The lower bound cor¬ 
responds to the SDR and SIR in the original mixture. The 
level mask described in section 3.2 was applied to all signals 
so that only separation of significant observations could be 
evaluated. 

For tests made on simulated spectrograms, since there is 
no ID signal to compare with, the separation was evaluated 
by a pointwise / operation between estimated and oracle 
binary masks: we define the mask error by the ratio of points 
in the estimated mask that differs from the oracle mask. 
















5.2 Simulated Data 

To validate our model we first tested the algorithm on 
simulated data. We simulated ILD and IPD spectrograms 
with F = 1024 frequency channels from 0 to 12, 000Hz, and 
10 to 126 significant observations per channel (Mf randomly 
drawn for each /). First, K source positions x\ ... xk are 
randomly drawn from X. Then, each spectrogram point 
(/, m) is randomly assigned to one of the K sources, and we 
generate ILD and IPD observations ct/, m and 0/, m using the 
distributions J\T(/if(xk),<r}(xk)) and ZAf(£f(xk),p}(xk)), 
where cr 2 {xk) and p 2 (xk) correspond to ILD and IPD vari¬ 
ances of white noise emitted from Xk, and are estimated 
from the dataset. A total of 200 mixtures composed of 2 to 
5 sources were generated, and K was set to the correct num¬ 
ber of sources for each test. Table 1 shows the percentage 
of correctly localized sources, the mean mask error obtained 
over all sources, and the mean mask error obtained with 
randomly generated binary masks. 


Table 1: Mean localization and separation results for 200 simu¬ 
lated mixtures of 2 to 5 sources. 


K 

Correctly 

Localized(%) 

Estimated Mask 
Error (%) 

Random Mask 
Error (%) 

2 

99.2 

20.7 

50.0 

3 

91.2 

25.4 

55.6 

4 

54.1 

28.9 

62.5 

5 

14.4 

29.3 

68.0 


These results show that our model is very well suited for 
both separation and localization tasks. A very high localiza¬ 
tion accuracy is achieved for mixtures composed of 2 and 3 
sources. When the number of sources is higher, the number 
of observations per source decreases while the number of lo¬ 
cal maxima of the log-likelihood function increases, leading 
to poorer results. 

5.3 Real Mixtures from Learned Positions 

Three sounds were used to test the algorithm with real 
data: male speech, female speech, and flute melody (see sec¬ 
tion 4). Mixtures were obtained by summing raw binaural 
signals from the dataset, corresponding to randomly drawn 
positions in X. ILD and IPD spectrograms were computed 
from these mixtures, and only significant observations cor¬ 
responding to points with a higher acoustic level than the 
corresponding background mixtures were kept, as described 
in section 2. 

First, 200 mixtures for each possible pair of test sounds 
were generated, resulting in 1,200 localization-separation 
tasks. Mean results obtained with our method for all these 
tasks as well as mean results for correctly localized sources 
only are shown in Table 2. SDR and SIR scores are com¬ 
pared to those of the original mixture and oracle mask. 

A very high localization rate is obtained, with 84.6% of 
the 1, 200 test sound sources localized with less that 2° er¬ 
ror in both azimuth and elevation. In addition, significant 
improvements of SDR and SIR ratios over the original mix¬ 
tures are achieved. This is particularly true for SIR ratios 
that almost reached oracle ratios in some tests. This shows 
that although our algorithm generates some artifacts in the 
target sound signal during the spectral masking process, it is 
able to considerably reduce the volume of the interferer and 


Table 2: Mean angle error, SDR (Source-to-Distortion Ratio) 
and SIR (Source-to-Interferences Ratio) for real mixtures of 2 
sound sources. 



Ang Err(°) 

SDR(dB) 

SIR(dB) 

Original 

- 

0.05 

0.05 

Us (All sources) 

11.5 

4.11 

14.3 

Us (Loc: 84.6%) 

0.18 

5.17 

15.2 

Oracle 

- 

17.9 

39.6 


thus significantly improve the perceived quality. Finally, we 
note that SDR and SIR ratios are always better when the 
algorithm could correctly localize the sound source, which 
demonstrates the importance of localization cues for sound 
sources separation. 

A second experiment was made with 200 mixtures of the 
three test sounds emitting altogether from random positions 
in X , resulting in 600 localization-separation tasks. Results 
are displayed in a similar fashion in Table 3. 


Table 3: Mean angle error, SDR and SIR for real mixtures of 3 
sound sources. 



Ang Err(°) 

SDR(dB) 

SIR(dB) 

Original 

- 

-3.61 

-3.61 

Us (All sources) 

35.5 

-4.03 

3.21 

Us (Loc: 45.2%) 

0.18 

0.09 

7.29 

Oracle 

- 

14.5 

32.7 


Although performances significantly decrease for this very 
challenging task (a Is mixture of 3 equally loud sounds is 
hard to decipher even for humans), we note that the algo¬ 
rithm could still accurately localize almost half of the 600 
individual sources, while significantly improving their SDR 
and SIR with respect to the original mixture. 

5.4 Real Mixtures from Unlearned Positions 

The last experiment is also the most challenging one. In¬ 
deed, we tested the ability of our algorithm to separate 
sound sources emitting from positions outside the train¬ 
ing set. More precisely, we made several test recordings in 
which two loud speakers were simultaneously emitting dif¬ 
ferent sounds while the robot stayed in its reference position 
(0°, 0°). One of the loud speaker was placed in a frontal po¬ 
sition corresponding to the training dataset, while the sec¬ 
ond one was manually placed in 21 side positions around 
the robot, with a 10° to 90° azimuth distance and 0° to 30° 
elevation distance. These experiments were repeated for the 
six possible pairs of test sounds, resulting in 252 separation- 
localization tasks. The frontal loud speaker was correctly 
localized at (0°,0°) in 95.2% of the mixtures tested. Lo¬ 
calization performances for the side loud speaker were not 
evaluated, as no ground truth was available. Table 4 shows 
the mean SDR and SIR scores of each source obtained with 
our approach, the original mixture, and the oracle mask. 

Despite a decrease of performances with respect to sound 
mixtures from learned positions, the significant improve¬ 
ment of SIR obtained in realistic cocktail-party like scenarios 
puts forward our approach as a promising tool for auditory 
human-robot interactions. 











Table 4: Mean SDR and SIR of frontal and side speakers 



Frontal 

SDR(dB) 

Frontal 

SIR(dB) 

Side 

SDR(dB) 

Side 

SIR(dB) 

Original 

-0.28 

-0.28 

0.32 

0.32 

Us 

-1.09 

3.01 

-3.30 

5.69 

Oracle 

18.29 

40.64 

18.63 

40.13 


6. CONCLUSION AND FUTURE WORK 

Traditionally, computational auditory scene analysis was 
addressed with either a close-range or an array of micro¬ 
phones and using simulated or anechoic audio data. We 
propose to bridge the gap between constrained and uncon¬ 
strained audio analysis and to apply CASA to HRI. We pro¬ 
pose a system integrating the audio-motor abilities of a robot 
within a unified framework, and performing sound-source 
separation and localization in a realistic cocktail-party like 
scenario. By providing a genuine audio-motor database and 
presenting encouraging results obtained from these data, we 
presented a benchmark for the unexplored field of sensori¬ 
motor learning for robot audition. 

One of the most interesting and promising directions will 
be to extend our model to a continuous space of sound source 
positions. This could be done using the manifold structure 
of interaural parameters studied in detail in [8]. By approx¬ 
imating this manifold by local tangent spaces, the size of 
the training set could be considerably reduced, thus speed¬ 
ing up the M-step, while improving the localization of sound 
sources from unknown places. Dynamic models incorporat¬ 
ing moving sound sources and head movements could also 
be included based on this idea. Finally, a “garbage” source 
class could be added to our model in order to better deal 
with background and non-point sources. We believe that 
these ideas combined with careful algorithm and software 
optimization could lead to a novel robot hearing paradigm 
within the emerging field of human-robot interaction. 
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